-
Notifications
You must be signed in to change notification settings - Fork 232
Add --tmp-dir option for archive creation
#6946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6946 +/- ##
==========================================
+ Coverage 79.03% 79.03% +0.01%
==========================================
Files 566 566
Lines 43699 43721 +22
==========================================
+ Hits 34531 34552 +21
- Misses 9168 9169 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c116bc1 to
0fd924d
Compare
777891b to
19c9dd3
Compare
| 'output_id': d.target_id, | ||
| 'label': d.link_label, | ||
| 'type': d.link_type, | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff is large here because of the additional try-except and the indent that goes with it (to capture disk-space errors) but the actual code inside should be the same!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git diff --ignore-all-space main src/aiida/tools/archive/create.py gives instead only:
diff --git a/src/aiida/tools/archive/create.py b/src/aiida/tools/archive/create.py
index 94ca88cd4..c4f0671d5 100644
--- a/src/aiida/tools/archive/create.py
+++ b/src/aiida/tools/archive/create.py
@@ -12,6 +12,7 @@ The archive is a subset of the provenance graph,
stored in a single file.
"""
+import os
import shutil
import tempfile
from datetime import datetime
@@ -59,6 +60,7 @@ def create_archive(
compression: int = 6,
test_run: bool = False,
backend: Optional[StorageBackend] = None,
+ tmp_dir: Optional[Union[str, Path]] = None,
**traversal_rules: bool,
) -> Path:
"""Export AiiDA data to an archive file.
@@ -139,6 +141,12 @@ def create_archive(
:param backend: the backend to export from. If not specified, the default backend is used.
+ :param tmp_dir: Directory to use for temporary files during archive creation.
+ If not specified, a temporary directory will be created in the same directory as the output file
+ with a '.aiida-export-' prefix. This parameter is useful when the output directory has limited
+ space or when you want to use a specific filesystem (e.g., faster storage) for temporary operations.
+ The directory must exist and be writable.
+
:param traversal_rules: graph traversal rules. See :const:`aiida.common.links.GraphTraversalRules`
what rule names are toggleable and what the defaults are.
@@ -280,10 +288,32 @@ def create_archive(
EXPORT_LOGGER.report(f'Creating archive with:\n{tabulate(count_summary)}')
+ # Handle temporary directory configuration
+ tmp_prefix = '.aiida-export-'
+ if tmp_dir is not None:
+ tmp_dir = Path(tmp_dir)
+ if not tmp_dir.exists():
+ EXPORT_LOGGER.warning(f"Specified temporary directory '{tmp_dir}' doesn't exist. Creating it.")
+ tmp_dir.mkdir(parents=True)
+ if not tmp_dir.is_dir():
+ msg = f"Specified temporary directory '{tmp_dir}' is not a directory"
+ raise ArchiveExportError(msg)
+ # Check if directory is writable
+ # Taken from: https://stackoverflow.com/a/2113511
+ if not os.access(tmp_dir, os.W_OK | os.X_OK):
+ msg = f"Specified temporary directory '{tmp_dir}' is not writable"
+ raise ArchiveExportError(msg)
+
+ else:
+ # Create temporary directory in the same folder as the output file
+ tmp_dir = filename.parent
+
# Create and open the archive for writing.
# We create in a temp dir then move to final place at end,
# so that the user cannot end up with a half written archive on errors
- with tempfile.TemporaryDirectory() as tmpdir:
+ try:
+ tmp_dir.mkdir(parents=True, exist_ok=True)
+ with tempfile.TemporaryDirectory(dir=tmp_dir, prefix=tmp_prefix) as tmpdir:
tmp_filename = Path(tmpdir) / 'export.zip'
with archive_format.open(tmp_filename, mode='x', compression=compression) as writer:
# add metadata
@@ -302,7 +332,9 @@ def create_archive(
}
)
# stream entity data to the archive
- with get_progress_reporter()(desc='Archiving database: ', total=sum(entity_counts.values())) as progress:
+ with get_progress_reporter()(
+ desc='Archiving database: ', total=sum(entity_counts.values())
+ ) as progress:
for etype, ids in entity_ids.items():
if etype == EntityTypes.NODE and strip_checkpoints:
@@ -359,7 +391,9 @@ def create_archive(
# stream node repository files to the archive
if entity_ids[EntityTypes.NODE]:
- _stream_repo_files(archive_format.key_format, writer, entity_ids[EntityTypes.NODE], backend, batch_size)
+ _stream_repo_files(
+ archive_format.key_format, writer, entity_ids[EntityTypes.NODE], backend, batch_size
+ )
EXPORT_LOGGER.report('Finalizing archive creation...')
@@ -368,6 +402,16 @@ def create_archive(
filename.parent.mkdir(parents=True, exist_ok=True)
shutil.move(tmp_filename, filename)
+ except OSError as e:
+ if e.errno == 28: # No space left on device
+ msg = (
+ f"Insufficient disk space in temporary directory '{tmp_dir}'. "
+ f'Consider using --tmp-dir to specify a location with more available space.'
+ )
+ raise ArchiveExportError(msg) from e
+
+ msg = f'Failed to create temporary directory: {e}'
+ raise ArchiveExportError(msg) from e
EXPORT_LOGGER.report('Archive created successfully')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification @GeigerJ2! This'll help in my review <3
--tmp-dir option for archive creation
d706758 to
3bb71b5
Compare
b7edfa7 to
4a1c79d
Compare
|
Dogfooding:
I checked, and the temporary directories are indeed created in the desired locations. 👌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GeigerJ2! Very happy to have this feature, and that the default path is now not in /tmp. I think having the parent of the filename makes a lot more sense.
Before you look at my comments: one thing I'm wondering is if that is the default, would there be any reason to specify a different tmp_dir? If I look at the code in more detail, the only reason it is creating a temporary directory is because we don't want to leave behind a partially written archive. In the end, the tmp_filename is simply moved to the target filename with shutil.move, and then the temporary directory is deleted. The target directory must have enough space for this to work.
I think @giovannipizzi suggested to have a --tmp-dir option, so the user could specify it. But now that I see the code in more detail, I don't see the use case. I think just changing the default to the parent of the filename of the archive is fine.
| @click.option( | ||
| '--tmp-dir', | ||
| type=click.Path(exists=True, file_okay=False, dir_okay=True, writable=True, path_type=Path), | ||
| help='Directory to use for temporary files during archive creation. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small concern here is that it sounds like the directory with the .aiida-export- prefix is only created when the parameter is not specified, and that in case it is the provided directory replaces that one. I wonder if the following text is not more clear:
Location where the temporary directory will be written during archive creation.
The directory must exist and be writable, and defaults to the parent directory of the output file.
This parameter is useful when the output directory has limited space or when you want to use a specific filesystem (e.g., faster storage) for temporary operations.
src/aiida/tools/archive/create.py
Outdated
| :param backend: the backend to export from. If not specified, the default backend is used. | ||
| :param tmp_dir: Directory to use for temporary files during archive creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
| readonly_tmp.mkdir() | ||
|
|
||
| # Make directory read-only | ||
| readonly_tmp.chmod(stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have use stat here? You should also be able to pass an octal (I figured out recently ^^):
readonly_tmp.chmod(0o444)
| if tmp_dir is not None: | ||
| tmp_dir = Path(tmp_dir) | ||
| if not tmp_dir.exists(): | ||
| EXPORT_LOGGER.warning(f"Specified temporary directory '{tmp_dir}' doesn't exist. Creating it.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour is different from the CLI command, that does expect the temporary directory to exist. It also seems to conflict with the docstring of the function.
src/aiida/tools/archive/create.py
Outdated
| EXPORT_LOGGER.report(f'Creating archive with:\n{tabulate(count_summary)}') | ||
|
|
||
| # Handle temporary directory configuration | ||
| tmp_prefix = '.aiida-export-' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define this variable, instead of just passing its value in the TemporaryDirectory constructor?
| 'label': d.link_label, | ||
| 'type': d.link_type, | ||
| try: | ||
| tmp_dir.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line added here? Everything related to checking the tmp_dir input should be above, and for the default there should be no need to create the parent directory, the command should fail in case that doesn't exist.
| if e.errno == 28: # No space left on device | ||
| msg = ( | ||
| f"Insufficient disk space in temporary directory '{tmp_dir}'. " | ||
| f'Consider using --tmp-dir to specify a location with more available space.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this can also happen when the archive_create function is used directly:
| f'Consider using --tmp-dir to specify a location with more available space.' | |
| f'Consider using `tmp-dir` to specify a location with more available space.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still checking the code above in more detail, but what happens if there is enough space in the tmp_dir, but not in the target directory for the archive? Will the shutil.move command also fail with OSError with number 28? In that case, won't we be misleading the user?
|
|
||
| EXPORT_LOGGER.report('Finalizing archive creation...') | ||
|
|
||
| if filename.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not part you just put in the try-except block, but it's interesting. I'd move this logic to the lines after the comment on line 167 (#check/set archive file path), since that is dealing with the target filename of the archive existing and whether or not to overwrite it.
| if filename.exists(): | ||
| filename.unlink() | ||
|
|
||
| filename.parent.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not your code, but something to think about: Apparently you can just write the archive to a folder that doesn't exist and the command will create it and all non-existent parents. I think that's ok. But the behaviour is then different from the tmp_dir one. I suppose there you don't want to just create some directory with a bunch of parents that you then have to clean up after?
Just for my curiosity, what were the problems of using |
Running out of disk space on the machine running AiiDA due to creation of the temporary archive file. |
3963661 to
23ca013
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
23ca013 to
a515785
Compare
|
@GeigerJ2 let me know when you want me to review this one again. Unless I am re-requested a review, I typically ignore open PRs that are changing. |
Hi @mbercx, thank you for checking on this again. I'd like to finalize the decision if we even keep the |
Adds a new
--tmp-dircommand line option toverdi archive createthat allows users to specify a custom directory for temporary files during archive creation.Changes:
--tmp-dirCLI option with validation (must exist and be writable)/tmp!Use cases:
Example: